智能论文笔记

Fine-grained Czech News Article Dataset: An Interdisciplinary Approach to Trustworthiness Analysis

Matyáš Boháček , Michal Bravanský , Filip Trhlík , Václav Moravec

分类：自然语言处理

2022-12-16

We present the Verifee Dataset: a novel dataset of news articles with fine-grained trustworthiness annotations. We develop a detailed methodology that assesses the texts based on their parameters encompassing editorial transparency, journalist conventions, and objective reporting while penalizing manipulative techniques. We bring aboard a diverse set of researchers from social, media, and computer sciences to overcome barriers and limited framing of this interdisciplinary problem. We collect over $10,000$ unique articles from almost $60$ Czech online news sources. These are categorized into one of the $4$ classes across the credibility spectrum we propose, raging from entirely trustworthy articles all the way to the manipulative ones. We produce detailed statistics and study trends emerging throughout the set. Lastly, we fine-tune multiple popular sequence-to-sequence language models using our dataset on the trustworthiness classification task and report the best testing F-1 score of $0.52$. We open-source the dataset, annotation methodology, and annotators' instructions in full length at https://verifee.ai/research to enable easy build-up work. We believe similar methods can help prevent disinformation and educate in the realm of media literacy.

translated by 谷歌翻译

2022年俄罗斯对乌克兰的入侵正在两条战线上进行：一场残酷的地面战争和一场旨在掩盖和证明俄罗斯行动正当的双重虚假宣传运动。这项运动至少包括一个据称显示乌克兰总统Zelenskyy承认失败和投降的一个示例。为了期待这种形式的未来攻击，我们描述了一种面部和手势行为模型，该模型捕获了Zelenskyy的口语风格的独特特征。经过八个多个小时的真实视频的培训，我们表明，这种行为模型可以将Zelenskyy与深效冒险家区分开来。这种模型可以在战争中扮演重要角色，尤其是在战争的雾中 - 将真实的作用与区分。假。

translated by 谷歌翻译

从半结构化文件中提取信息对于无摩擦企业对企业（B2B）通信至关重要。尽管已经研究了与文档信息提取（IE）有关的机器学习问题数十年来，但许多常见的问题定义和基准并不能反映针对域特定方面和自动化B2B文档通信的实际需求。我们回顾文档的景观IE问题，数据集和基准。我们重点介绍了共同定义中缺少的实际方面，并定义了关键信息本地化和提取（KILE）和行项目识别（LIR）问题。由于其内容通常受到法律保护或敏感，因此缺乏用于半结构化业务文档的文档IE的相关数据集和基准。我们讨论了包括合成数据在内的可用文档的潜在来源。

translated by 谷歌翻译